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ABSTRACT: Understanding how monomeric proteins fold under in vitro conditions is cru- 
cial to describing their functions in the cellular context. Significant advances both in theory 
and experiments have resulted in a conceptual framework for describing the folding mechanisms 
of globular proteins. The experimental data and theoretical methods have revealed the multi- 
faceted character of proteins. Proteins exhibit universal features that can be determined using 
only the number of amino acid residues (N) and polymer concepts. The sizes of proteins in the 
denatured and folded states, cooperativity of the folding transition, dispersions in the melting 
temperatures at the residue level, and time scales of folding are to a large extent determined 
by N. The consequences of finite N especially on how individual residues order upon folding 
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depends on the topology of the folded states. Such intricate details can be predicted using the 
Molecular Transfer Model that combines simulations with measured transfer free energies of 
protein building blocks from water to the desired concentration of the denaturant. By watching 
one molecule fold at a time, using single molecule methods, the validity of the theoretically 
anticipated heterogeneity in the folding routes, and the JV-dependent time scales for the three 
stages in the approach to the native state have been established. Despite the successes of theory, 
of which only a few examples are documented here, we conclude that much remains to be done 
to solve the "protein folding problem" in the broadest sense. 
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1 INTRODUCTION 

The quest to solve the protein folding problem in quantitative detail, which is 
surely only the first step in describing the functions of proteins in the cellular 
context, has led to great advances on both experimental and theoretical fronts 



(85 92 38 144 130 125 127 30 105 28 117 6 114 35 141 5 26 27 8 43). In the 



process our vision of the scope of the protein folding problem has greatly expanded. 
The determination of protein structures by X-ray crystallography (70!) and the 
demonstration that proteins can be reversibly folded following denaturation ([3]) 
ushered in two research fields. The first is the prediction of the three dimensional 



structures given the amino acid sequence (11 97), and the second is to describe the 



folding kinetics (114 106 125). Another line of inquiry in the protein folding field 



opened with the discovery that certain proteins require molecular chaperones to 



reach the folded state (46 140 58 128). More recently, the realization that protein 



misfolding, which is linked to a number of diseases, has provided additional wrinkles 



to the already complicated protein folding problem (21.116 126 36). Although 
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known for a long time it is becoming more widely appreciated that the restrictions in 
the conformational space in the tight cellular compartments might have significant 



effect on all biological processes including protein folding (20 142). In all these 
situations the protein folding problem is at the center stage. The solution to this 
problem requires a variety of experimental, theoretical and computational tools. 
Advances in all these fronts have given us hope that many aspects of, perhaps the 
"simplest" of the protein folding problems, namely, how single domain globular 
proteins navigate the large dimensional potentially rugged free energy surface en 
route to the native structure are under theoretical control. 

Much of our understanding of the folding mechanisms comes from studies of 
proteins that are described using the two-state approximation in which only the 
unfolded and folded states are thought to be significantly populated. However, 
proteins are finite size branched polymers in which the native structure is only 
marginally stabilized by a number of relatively weak (~ O(ksT)) interactions. 
From a microscopic point of view the unfolded state and even the folded state 
should be viewed as an ensemble of structures. Of course, under folding conditions 
the fluctuations in the native state are less than in the unfolded state. In this pic- 
ture rather than viewing protein folding as a unimolecular reaction (U <H- F where 
U and F being the unfolded and folded states respectively) one should think of 
the folding process as the interconversion of the conformations in the Denatured 
State Ensemble (DSE) to the ensemble of structures in the Native Basin of At- 
traction (NBA). The description of the folding process in terms of distribution 
functions necessarily means that appropriate tools in statistical mechanics together 
with concepts in polymer physics ([23] [42| [3l| [49]) are required to understand the 



self-organization of proteins, and for that matter RNA (1251. 

Here, we provide theoretical perspectives on the thermodynamics and kinetics of 
protein folding of small single domain proteins with an eye towards understanding 
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and anticipating the results of single molecule experiments. The outcome of these 
experiments are most ideally suited to reveal the description based on changes 
in distribution functions that characterize the conformations of proteins as the 
external conditions are varied. Other complementary theoretical viewpoints on 
the folding of single domain proteins have been described by several researchers 



(106 117 28 118 120) 



2 UNIVERSALITY IN PROTEIN FOLDING THERMODYNAM- 
ICS 

The natural variables that should control the generic behavior of protein folding 
are the length (N) of the protein, topology of the native structure ([5]), symmetry 



of the native state (79 135), and the characteristic temperatures that give rise to 



the distinct "phases" that a protein adopts as the external conditions (such as tem- 



perature T or denaturant concentration ([C])) are altered (123). In terms of these 
variables, several universal features of the folding process can be derived, which, 
shows that certain aspects of protein folding can be understood using concepts 
developed in polymer physics (23,42, 31 1[49|). 

2.1 Protein Size Depends on Length 

Under strongly denaturing conditions proteins ought to exhibit random coil char- 
acteristics. If this were the case, then based on Flory theory ( |42| ), the radius 
of gyration (Rg) of proteins in the unfolded state must scale as Rq ~ aoN" 
where ap is a characteristic Kuhn length, N is the number of amino acid residues, 
and v sa 0.6. Analysis of experimental data indeed confirms the Flory prediction 



(Fig. la) (80), which holds good for homopolyrners in good solvents. Because folded 
proteins are maximally compact the native states should obey Rq ~ clnN v with 
v = 1/3. Explicit calculations of Rg for a large number of proteins in the Protein 
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Data Bank (PDB) show that the expected scaling is obeyed for the folded states 



as well (Fig. lb) (29) 



2.2 Characteristic Phases 

Proteins are finite-sized systems that undergo phase changes as the quality of solvent 
is decreased. As the T ([C]) is lowered to the collapse temperature Tq ([C]g), which 
decreases the solvent quality, a transition from an expanded to an ensemble of 
compact structures must take place. The collapse transition can be either first or 
second order ( |23[ ), depending on the nature of the solvent-mediated interactions. In 
a protein there are additional energy scales that render a few of the exponentially 
large number of conformations lower in free energy than the rest. These minimum 
energy compact structures (MECS) direct the folding process ( fl7| ). When the 
temperature is lowered to the folding transition temperature Tp, a transition to 
the folded native structure takes place. These general arguments suggest that 
there are minimally three phases for a protein as T or [C] is varied. They are the 
unfolded (U) states, an ensemble of intermediate (I) structurally heterogeneous 
compact states, and the native state. 

An order parameter that distinguishes the U and / states is the monomer density, 
p = N/Rq. It follows from the differences in the size dependence of Rq in the U 
and I states with N (Fig. 1) that p sw in the U phase while p w O(l) in the I and 
the NBA. The structural overlap function (%), which measures the similarity to the 
native structure, is necessary to differentiate between the I and the conformations 
in the NBA. The collapse temperature may be estimated from the changes in the 
Rg values of the unfolded state as T is lowered while Tp may be calculated from 
Ax = (x 2 ) — (x) 2 j the fluctuations in %• 



Thirumalai et. al. 

2. 3 Scaling of Folding Cooperativity with N is Universal 

A hallmark of the folding transition of small single domain proteins is that it is 
remarkably cooperative (Fig. 2). The marginal stability criterion can be used to 
infer the iV-dependent growth of a dimensionless measure of cooperativity Q c = 



^|#^a| t=Tf (74), where AT is the full width at half maximum of 1*^1 



a way that reflects both the finite-size of proteins and the global characteristics of 
the denatured states. 

The dependence of Sl c on N is derived using the following arguments (88). (i) 
Ax is analogous to susceptibility in magnetic systems and hence can be written as 
A% = T\d{x) / dh\, where h is an ordering field conjugate to \- Because A% is dimen- 
sionless, we expect that the ordering field h ~ T. Thus, T\d(x)/dT\ ~ T\dfNBA/dT\ 
plays the role of susceptibility in magnetic systems, (ii) Efficient folding in appar- 



ent two-state folders implies Tp s=s Tg ( 16 ) (or equivalently Cg s=s Cp (|74[) when 



folding is triggered by denaturants). Therefore, the critical exponents that control 
the behavior of the polypeptide chain at Tg must control the thermodynamics of 
the folding phase transition. At T « Tg « T F the Flory radius R G ~ AT - " ~ N v . 
Thus, AT ~ iV _1 (Fig. 2b). Because of the analogy to magnetic susceptibility, we 
expect T\d(x)/dT\ ~ iV 7 . Using these results we obtain Q c sa N^ where £=1+7, 
which follows from the hypothesis that Tp ~ Tg. The fifth order e expansion for 
polymers using ^-component field theory with n — > gives 7 = 1.22, giving £ = 2.22 



(721 



The linear fit to the log-log plot of the dependence of il c for proteins shows 
that £ = 2.17 ± 0.09 for proteins (Fig. 2c). The remarkable finding that expresses 
cooperativity in terms of N and £ gives further credence to the proposal that 



efficient folding is achieved if sequences are poised to have Tp ~Tg (16 73). 
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3 GENERAL PRINCIPLES THAT GOVERN FOLDING KI- 
NETICS 

A few general conclusions about how proteins access the NBA may be drawn 
by visualizing the folding process in terms of navigation of a large dimensional 
folding landscape (Fig. 3a) . Dynamics of random heteropolymers have shown that 



their energy landscapes are far too rugged to be explored (12) on typical folding 



times (on the order of milliseconds). Therefore, the energy landscape of many 



evolved proteins must be smooth (or funnel-like (84 106 .28)) i.e., the gradient of 
the energy landscape towards the NBA is "large" enough that the biomolccule 
does not pause in Competing Basins of Attraction (CBAs) for long times during 
the folding process. Because of energetic and topological frustration the folding 
landscapes of even highly evolved proteins is rugged on length scales smaller than 



Rg (130 63). In the folded state, the hydrophobic residues are usually sequestered 
in the interior while polar and charged residues are better accommodated on the 
protein surface. Often these conflicting requirements cannot be simultaneously 



satisfied and hence proteins can be energetically "frustrated" (50.22). If the packing 



of locally formed structures is in conflict with the global fold then the polypeptide 
chain is topologically frustrated. Thus, the energy landscape rugged on length 
scales that are larger than those in which secondary structures (~ (1 — 2) nm) form 
even if folding can be globally described using the two-state approximation. 

There are several implications of the funnel-like and rugged landscapes for folding 
kinetics (Fig. 3a). (i) Folding pathways are diverse. The precise folding trajectory 
that a given molecule follows depends on the initial conformation and the location 
in the landscape from which folding commences, (ii) If the scale of ruggedness is 
small compared to ksT (ks is the Boltzmann constant) then trapping in CBAs for 
long times is unlikely, and hence folding follows exponential kinetics, (iii) On the 
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other hand if the space of CBAs is large then a substantial fraction of molecules 
can be kinetically trapped in one or more of the CBAs. If the time scale of 
interconversion between the conformations in the CBAs and the NBA is long 
then the global folding would occur through well-populated intermediates. 

3. 1 Multiple Folding Nuclei (MFN) Model 

Theoretical studies ( |50{|124|[T3{ [T|) and some experiments (39 65 ) suggest that ef- 
ficient folding of these proteins is consistent with a NC mechanism according to 
which the rate limiting step involves the formation of one of the folding nuclei. Be- 
cause the formation of the folding nucleus and the collapse of the chain are nearly 
synchronous, we referred to this process as the NC mechanism. 

Simple theories have been proposed to estimate the free energy cost of producing 
a structure that contains a critical number N^ residues whose formation drives the 



structure to the native state (136 50 19). In the simple NC picture the barrier to 



folding occurs because the formation of contacts (native or non-native) involving 
the Nft residues, while enthalpically favorable, is opposed by surface tension. In 
addition, formation of non-native interactions in the transition state also creates 
strain in the structures representing the critical nuclei. Using a version of the 
nucleation theory and structure-based thermodynamic data, we showed the average 
size of the most probable nucleus N^, for single domain proteins, is between 15-30 



residues (19). 

Simulations using lattice and off- lattice established the validity of the MFN model 
according to which certain contacts (mostly native) in the conformations in Tran- 
sition State Ensemble (TSE) form with substantial probability (> 0.5). An illus- 



tration (Fig. 3c) is given from a study of the lattice model with side chains (77) in 
which the distribution of native contacts (P/v(g)) shows that about 45% of the total 
number of native contacts have high probability of forming in the TSE and none 
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of them form with unit probability. Although important ( |86| ), very few non- native 
contacts have high probability of forming at the transition state. 

3.2 Kinetic Partitioning Mechanism (KPM) 

When the scale of roughness far exceeds ksT so that the folding landscape par- 
titions into a number of distinct CBAs that are separated from each other and 
the NBA by discernible free energy barriers (Fig. 3a) then folding is best described 
by the KPM. A fraction $ of molecules can reach the NBA rapidly Fig. 3a). The 
remaining fraction, 1 — $, is trapped in a manifold of discrete intermediates. Since 
the transitions from the CBAs to the NBA involve partial unfolding, crossing of 
the free energy barriers for this class of molecules is slow. The KPM explains the 
not only the folding of complex structured proteins but also counterion-induced 



assembly of RNA especially Tetrahymena ribozyme ( 125 ). For RNA and large pro- 



teins $ « (0.05 - 0.2) (125 71 107). The KPM is also the basis of the Iterative 



Annealing Mechanism (132 122). 



3.3 Three Stage Multipathway Kinetics and the Role of N 

The time scales associated with distinct routes followed by the unfolded molecules 
(Fig. 3) can be approximately estimated using N. For the case when $ ss 0, the 
folding time r F ~ t q N 2+8 where the 1.8 < 9 < 2.2 ( fl23| ) . The theoretically 
predicted power law dependence was validated in lattice model simulations in a 
subsequent study ( |5lj ). 

Simulations using lattice and off-lattice models showed that the molecules that 



follow the slow track reach the native state in three stages (Fig. 3b) (16 50 123). 

Non-specific Collapse: In the first stage the polypeptide chain collapses to an 
ensemble of compact conformations driven by the hydrophobic forces. The confor- 
mations even at this stage might have fluctuating secondary and tertiary structures. 
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By adopting the kinetics of coil-globule formation in hompolymers it was shown that 
the time scale for non-specific collapse r nc m t c qN 2 . 

Kinetic Ordering: In the second phase the polypeptide chain effectively discrim- 
inates between the exponentially large number of compact conformations to attain 
a large fraction of native-like contacts. At the end of this stage the molecule finds 
one of the basins corresponding to the MECS. Using an analogy to reptation in 
polymers we suggested that the time associated with this stage is t^q ~ tkqqN 3 



(171 



All or None: The final stage of folding corresponds to activated transitions from 
one of the MECS to the native state. A detailed analysis of several independent 
trajectories for both lattice and off-lattice simulations suggest that there are multi- 
ple pathways that lead to the structures found at the end of the second stage. There 
are relatively few paths that connect the native state and the numerous native-like 
conformations located at the end of the second stage (Fig. 3b). 

In majority of ensemble experiments only the third folding stage is measured. 
The folding time tf ~ tq exp(AF$ /fcsT) where the barrier height AF$ ps \/N. 



Others have argued that AF t « N 2/3 (136 40). The limited range of N for which 



data are available makes it difficult to determine the exponent unambiguously. 



However, correlation of the stability of the folded states ( 123 ) expressed as Z-score 



(oc VN) with folding time (75) shows that \A/V scaling (|98l|2j) is generic (Fig. 3d). 



4 MOVING FORWARD : NEW DEVELOPMENTS 

Theoretical framework and simulations (especially using a variety of coarse-grained 



models (55 22 47 48 101 69 68 57 1) have been instrumental in making testable 
predictions for folding of a number of proteins. For example, by combining struc- 
tural analyses of a number of SH3 domains using polymer theory, and off-latice 
simulations we showed that the stiffness of the distal loop is the reason for the 
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observation of polarized transition state in src SH3 and a-spectrin SH3 (78). The 



theoretical prediction was subsequently validated by Serrano and coworkers (121). 
This and other successful applications that combine simulations and experiments 
legitimately show that, from a broad perspective, how proteins fold is no longer as 
daunting a problem as it once seemed. 

On the experimental front impressive advances, especially using single molecule 
FRET (smFRET) ([l0^[TT9j[T09}[54J[Tl0}[l4}[82}[90}|Tl5]) and Single Molecule Force 



spectroscopy (SMFS) (139,45,37,24) pose new challenges that demand more quan- 



titative predictions. Although still in their infancy, single molecule experiments 
have established the need to describe folding in terms of shifts in the distribution 
functions of the properties of the proteins as the conditions are changed, rather 
than using the more traditional well-defined pathway approach. New models that 
not only make precise connections to experiments but also produce far reaching 
predictions are needed in order to take the next leap in the theory of protein fold- 
ing. 

5 MOLECULAR TRANSFER MODEL (MTM) : CONNECT- 
ING THEORY AND EXPERIMENT 

Almost all of the computational studies to date have been done using temperature 
to trigger folding and unfolding, while protein stability and kinetics in a majority 
of the experiments are probed using chemical denaturants. A substantial concep- 
tual advance to narrow the gap between experiments and computations was made 



with the introduction of the MTM theory ( |102||103[ ). The goal of the MTM is to 
combine simulations at condition A, and reweighting the protein conformational 
ensemble appropriately such that the behavior of the protein under solution condi- 
tion B(= {TbiPHb, [Cb]}) can be accurately predicted without running additional 
simulations at B. By using the partition function Z{A) = ^\ e - ^^" 4 ' in condi- 
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tion A (/3a — (I^bTa) 1 and Ei(A) is the potential energy of the i th microstate), 
and the free energy cost of transferring i from A to B (denoted Gt r ,i(A — » B)) 
the partition function Z(B) = £\ e -0B(£i(A)+G tr .,i(>i->-.B)) in conc }ition B can be 
calculated (Fig. 4a). 

5.1 Applications to Protein L and Cold Shock Protein 

In the applications of MTM theory to date we have used the C Q -side chain model 
(C a — SCM) for proteins so that accurate calculation of Z(A) can be made. The 



phenomenological Transfer Model (10), which accurately predicts m- values for a 
large number of proteins (Fig. 4b), is used to compute Gt r ,i(A — > B) for each protein 
conformation using the measured [C] -dependent transfer free energies of amino side 
chains and backbone from water to a [C]-molar solution of denaturant or osmolyte. 
The success of the MTM is evident by comparing the results of simulations with 
the GdmCl-dependent changes in Jnba and FRET efficienty ((E)) for protein L 
and CspTm cold shock proteins (Fig. 4c and 4d). Notwithstanding the discrepan- 
cies among different experiments, the predictions of (E) as a function of GdmCl 
concentration are in excellent agreement with experiments (Fig.4d). The calcula- 
tions in Fig. 4 are the first to show that quantitative agreement between theory and 
experiment can be obtained, thus setting the stage for extracting [C]-dependent 
structural changes that occur during the folding process. 

5.2 Characterization of the Denatured State Ensemble 

How does the DSE change as [C] decreases? A total picture of the folding process 
requires knowledge of the distributions of various properties of interest, namely, 
secondary and tertiary structure contents and the end-to-end distance R ee as [C] 
changes. The MTM simulations reveal a number of surprising results regarding 
the DSE properties of globular proteins in general and protein L and CspTm in 
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particular, (i) Certain properties (Rg for example) may indicate that high denatu- 
rant concentration is a good solvent for proteins (Fig. la) while others give a more 



nuanced picture of the DSE properties (1031. If high [C] is a good solvent then 



from polymer theory it can be shown that the end-to-end distribution function 
Pt{x) ~ x 5 exp(— a; 1 ^ 77 ), where x — R ee /(R ee ) ((R ee ) is the average end-to-end 
distance) should be universal with the exponent S sw 0.3 in three dimensions. Al- 
though the scaling of Rq ~ N" of the DSE with v « 0.6 (Fig. la) suggests that the 
DSE can be pictured as a random coil, the simulated P(x) for protein L deviates 
from Pr(x), which shows that even at high GdmCl remnants of structure must 
persist (Fig. 5a). (ii) An important finding in smFRET experiments is that the 
statistical characteristics of the DSE changes substantially as [C]< C m , the mid- 
point concentration at which the populations of the unfolded and folded structure 
are equal. For a number of proteins, including protein L and CspTm, there is a 
collapse transition predicted theoretically (Fig. 5b) and demonstrated in smFRET 



(114 119). For [C] >> C m only moderate changes in R® are observed while larger 
changes occur as [C] < C m (Fig. 5b). Concomitant with the equilibrium collapse, 
the fraction of residual structure increases, with the largest increase occurring be- 
low C m ( |103[ ). Thus, the DSE becomes compact and native-like as [C] decreases, 
which shows that the collapse process should be a generic step during the folding 
process (Fig. 5b). 



5.3 Constancy of m-values and Protein Collapse 

A number of the smFRET experiments show that the DSE undergoes a continuous 



collapse as [C] decreases (143), which implies that the accessible surface area must 
also change with decreasing denaturant concentration. These observations would 
suggest that the stability of the native state must be a non-linear function of [C] 
even when [C] > C m , which contradicts a large number of measurements, which 
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show that free energy chages linearly with [C]. The apparent contradiction was 
addressed using simulations and theory both of which emphasize the polymer nature 



of proteins (102 143). Explicit simulations of protein L showed that the constancy 
of m- value (= dAGjv£>/<i[C] where AGnd is the stability of the NBA with respect 
to DSE) arises because [C]-dependent surface area of the backbone that makes the 
largest contribution to m does not change appreciably when [C] > C m . In an 



alternative approach to the TM model, Ziv and Haran (143) used polymer theory 
and experimental data on 12 proteins and showed that the m-value can be expressed 
in terms of a [C]-dependent interaction energy and the volume fraction of the protein 
in the expanded state (Fig. 5f). 

The continuous nature of the collapse transition has also been unambiguously 
demonstrated in a series of studies by Udgaonkar and coworkers (67 133 p3) who 



have shown that the collapse process (both thermodynamically and kinetically) is 
a continuous process, and the description of folding as a two-state transition clearly 
obscures the hidden complexity. 

5.4 Transition Midpoints are Residue-Dependent 

The obsession with the two-state description of the folding transition as [C] (or T) 
is changed, using only simple order parameters (see below), has led to molecular 
explanations of the origin of cooperativity without examination of the consequences 
of finite size effects. For instance, the van't Hoff criterion (coincidence of calori- 
metric enthalpy and the one extracted from fitting fNBA(i) to two states) and the 
superposition of denaturation curves generated using various probes such as SAXS, 
CD, and FRET are often used to assert that protein folding can be described using 
only two states. However, these descriptions, which use only a limited set of order 
parameters, are not adequate for fully describing the folding transition. 

The order parameter theory for first and second order phase transitions is most 
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useful when the decrease in symmetry from a disordered to an ordered phase can 
be described using simple physically transparent variables. For example, magneti- 
zation and Fourier components of the density are appropriate order parameters for 
spin systems (second order transition) and the liquid to solid transition (first order 
transition) (|108), respectively. In contrast, devising order parameters for complex 



phase transitions (spin glass transition (T05j) or liquid to glass transiton (129)) is 
often difficult. A problem in using only simple order parameters in describing the 
folding phase transition is that the decrease in symmetry in going from the unfolded 



to the folded state cannot be unambiguously identified (see however (135 79)). It 
is likely that multiple order parameters are required to characterize protein struc- 
tures, which makes it difficult to assess the two state nature of folding using only 
a limited set of observables. Besides enthalpy and Rq the extent of secondary and 
tertiary structure formation as [C] is changed can also be appropriate order pa- 
rameters, for monitoring the folding process. Thus, multiple order parameters are 
needed to obtain a comprehensive view of the folding process. 

The MTM simulations can be used to monitor the changes in the conformations 
as [C] is changed using all of the order parameters described above. In particular, 
the simulations can be used to calculate C m ,i the transition midpoint at which the 
i th residue is structured. For a strict two-state system C m s = C m the global tran- 
sition midpoint for all i. However, several experiments on proteins that apparently 



fold in a two-state manner show that this is not the case (112 56). Deviations 
of the melting temperatures of the individual residues from the global melting 
temperature were first demonstrated by Holtzer for 33 residue GCN4-LZK pep- 



tide (56). In other words, the melting temperature is not unique but reflects the 
distribution in the enthalpies as the protein folds. These pioneering studies have 
been further corroborated by several recent experiments. Of particular note is the 
study of thermal unfolding of 40-residuc BBL using two-dimensional NMR. The 
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melting profile, using chemical shifts of 158 backbone and side-chains showed stun- 
ningly that the ordering temperatures are residue-dependent. The distribution of 
the melting temperatures peaked at T ~ 305 K, which correponds to the global 
melting temperature. However, the dispersion in the melting temperature is nearly 
40 K! 

The variations in the melting of individual residues are also seen in the MTM 
simulations involving denaturants. For protein L, the values of the denaturant 
(urea) unfolding of individual residues C m .i are broadly distributed with the global 
unfolding occurring at ~ 6.6 M (Fig.5e). The C m ^ values for protein L depend not 
only on the nature of the residues as well as the context in which the residue is 
formed. For example the C m ^ for Ala in the helical region of protein L is different 
from that in a /3 strands, which implies that not all Alanines within the same protein 
are structurally equivalent! Interestingly, the dispersion in melting temperature 
(Fig.5d) is less than in the C m ^ values, which accords with the general notion 
that thermal folding is more cooperative than denaturant-induced transitions. The 
variations in the melting temperatures (or C mj i), which is due to the finite size of 
proteins, should decrease as N becomes large. 

6 MECHANICAL FORCE TO PROBE FOLDING 

Single molecule force spectroscopy (SMFS), which directly probes the folding dy- 
namics in terms of the time-dependent changes in the extension x(t), has altered 
our perspective of folding by explicitly showing the heterogeneity in the folding 



dynamics (45). While bulk experiments provide an understanding of gross prop- 



erties, single molecule experiments can give a much clearer picture of the folding 



landscapes (63. 137. 138. 181, diversity of folding and unfolding routes (96 107), and 



the timescales of relaxation (61 81). SMFS studies using mechanical force are in- 



sightful because (i) mechanical force does not alter the interactions that stabilize 
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the folded states and conformations in the CBAs, (ii) the molecular extension x 
that is conjugate to / is a natural reaction coordinate, and (iii) they allow a direct 
determination of x as a function of t from which equilibrium free energy profiles 



and /-dependent kinetics can be inferred (61 137 107 131). Interpretation and 
predictions of the outcomes of SMFS results further illustrate the importance of 
theoretical concepts from polymer physics ( [23|[3l|[42|[49| ) , stochastic theory (81 531 
and hydrodynamics. 

Initially SMFS experiments were performed by applying a constant load r/ while 
more recently constant force is used to trigger folding. While / is usually applied 
at the endpoints of the molecule of interest, other points may be chosen p4| to 
more fully explore the folding landscape of the molecule. Despite the sequence- 
specific architecture of the folded state, the FECs can be quantitatively described 
using standard polymer models. The analyses of FECs using suitable polymer 
models immediately provide the persistence length (l p ) and contour length (L) 



of the proteins (15). Surprisingly, the FECs for a large number of proteins can 
be analyzed using the Wormlike Chain (WLC) for which equilibrium force as a 
function of extension is (91) l p f/k B T = x/L + 1/4(1 - x/L) 2 - 1/4, with L the 
length of the chain and l p the persistence length, the characteristic length scale 
of bending in the polymer. Disruption of internal structure, leading to rips in the 
FEC, provides glimpses into order of force-induced provided the structure of the 



folded state is known ( |134||89[|104 |. 

If / is constant using the force-clamp method ([89|9 p7|l34[ ), x(t) exhibits discrete 
jumps among accessible basins of attractions as a function of time. From a long 
time-dependent trajectory x(t) the transition rates between the populated basins 
can be directly calculated. If the time traces are "sufficiently" long to ensure 
that protein ergodically samples the accessible conformations an equilibrium f- 



dependent free energy profile (F(x)) can be constructed (137 61). 
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6. 1 Transition State Location and Hammond Behavior 

If rf is a constant the force required to unfold proteins varies stochastically, which 
implies that the rupture force (value of / at which NBA— ^stretched transition oc- 
curs) distribution, P(f), can be constructed using a multiple measurements. If un- 
folding is described by the Bell equation (unfolding rate k(f) = k(f = 0) exp (fAxTs/ksT) 
where Axts is the location of the TS with respect to NBA) then using /* ~ 
IcbT/Axts ' log r"/ 1 Axts carL be estimated. When the response of proteins over a 
large range of rf is examined the [log r/, /*] curve is non-linear, which is due to the 
dependence of Axts on rf (32,33,34,63) or the presence of multiple free energy 



barriers (94). For proteins (rf ~ 100 — 1000 pN/s), the value of Axts is in the 



range of 2 — 7 A depending on r/ (113 25 1 



The TS movement as / or r/ increases, can be explained using the Hammond 
postulate, which states that the TS resembles the least stable species along the 
folding reaction ( |52[ ). The stability of the NBA decrease as / increases, which 
implies that Axts should decrease as / is increased ([63]) . For soft molecules such as 
proteins and RNA, Axts always decreases with increasing rf and /. The positive 
curvature in [logr/,/*] plot is the signature of the classical Hammond behavior 



(64) 



6.2 Roughness of the Energy Landscape 

Hyeon and Thirumalai (HT) showed theoretically that if T is varied in SMFS studies 
then the /-dependent unfolding rate is given by log k(f, T) = a + b/T — e 2 /(ksT) 2 



(62 63). From the temperature dependence of k(f,T) (or fc(r/,T)) the values of e 



for several systems have been extracted (99 113 66). Nevo et al. measured e for a 



protein complex consisting of nuclear receptor importin-/? (imp-/?) and the Ras-like 
GTPase Ran that is loaded with non-hydrolysable GTP analogue. The values of 



/* at three temperatures (7, 20, 32 °C) were used to obtain e«[5- 6]k B T (99). 
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Recently, Schlierf and Rief (SR) (113) analyzed the unfolding force distribution 
(with rf fixed) of a single domain of Dictyostelium discoideum filamin (ddFLN4) 
at five different temperatures to infer the underlying one dimensional free energy 



surface. By adopting the HT theory (62) SR showed that the data can be fitted 
using e = 4fc B T for ddFLN4 unfolding. 

6. 3 Unfolding Pathways from FECs 

The FECs can be used to obtain the unfolding pathways. From FEC alone it is 
only possible to provide a global picture of /-induced unfolding. Two illustrations, 
one (GFP) for which predictions preceded experiments and the other (RNase-H), 
illustrate the differing response to force. 

6.4 RNase-H under Tension 

Ensemble experiments had shown that RNase-H, a 155 residue proteins, folds 



through an intermediate (/) that may be either on- or off-pathway (111 6). The 
FEC obtained from LOT experiments ( |18| showed that there is one rip in the 
unfolding at / « 15 — 20 pN corresponding to NBA— > U transition (see Fig. 6). 
Upon decreasing / there is a signature of I in the FEC corresponding to a partial 
contraction in length at / sa 5.5 pN, the midpoint at which U and I are equally 
populated. The reason for the absence of the intermediate in the unfolding FEC is 
due to the shape of energy landscape. Once the first barrier, which is significantly 
larger than the mechanical stability of the / state relative to U, is crossed, global 
unfolding occurs in a single step. In the refolding process, the / state is reached 
from U since the free energy barrier between I and U is relatively small. The path- 
ways inferred from FEC is also supported by the force-clamp method. Even when 
/ is maintained at / = 5.5 pN, the molecule can occasionally reach the N state by 
jumping over the barrier between N and I, which is accompanied by an additional 
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contraction in the extension. However, once the N state is reached, RNase-H has 
little chance to hop back to I within the observable time. Because in majority of 
cases the I —)■ N transiton out of the NBA ceases, it was surmised that / must be 
on-pathway. 

6.5 Pathway Bifurcation in the Forced-Unfolding of Green Fluorescence 
Protein (GFP) 

The nearly 250 residue Green Fluorescence Protein (GFP) has a barrel shaped 
structure consisting of 11 /3-strands with one a-helix at the N-terminal. Mechanical 
response of GFP, which depends both on loading rate and the stretching direction 



(96 24), is intricate. The unfolding FEC for GFP inferred from the first series of 
AFM experiments showed clearly well populated intermediates, which is in sharp 
contrast to RNAase-H. The assignment of the intermediates associated with the 
peaks in the FECs was obscured by the complex architecure of GFP. In the original 



studies (24) it was suggested that unfolding occurs sequentially with the single 
pathway being N -> [GFPAa] -» [GFPAaA/3] -> U, where Aa and A/3 denote 
rupture of a-helix and a /3-strand (Fig. 7) from the N-terminus (25). After the a- 
helix is disrupted, the second rip is observed due to the unraveling of (31 or /311, 
both of which have identical number of residues. A much richer and a complex 
landscape was predicted using the Self- Organized Polymer model (SOP) simulations 
performed at the loading rate used in AFM experiments (59). The simulations 
predicted that after the formation of [GFPAa] there is a bifurcation in the unfolding 
pathways. In the majority of cases the route to the U state involves population 
of two additional intermediates, [GFPA/?i] (A/3i is the N-terminal /3-strand) and 
[GFPAaA/3i A/32 A/33] . The most striking prediction of the simulations is that there 
is only one intermediate in unfolding pathway, N —t [GFPAa] — » [GFPAaA/3n] — > 



U\ (59 1. The predictions along with the estimate of the magnitude of forces were 
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quantitatively validated using SMFS experiments (96). 



6.6 Refolding upon Force-Quench 

Two novel ways of initiating refolding using mechanical force have been reported. 
In the first case a large constant force was applied to poly-ubiquitin (poly-Ub) to 
prepare a fully extended ensemble. These experiments (fig. 8a), which were the first 
to use fs — > /q jump to trigger folding, provided insights into the folding process 
that are in broad agreement with theoretical predictions, (i) The time dependent 
changes in x(t), following a fs — > /q quench, occurs in at least three distinct stages. 
There is a rapid initial reduction in x(t), followed by a long plateau in which x(t) 
is roughly a constant. The acquisition of the native structure in the last stage, 
which involves two phases, occurs in a cooperative process, (ii) There are large 
molecule-to- molecule variations in the dynamics of x(t) ([76k (iii) The time scale 
for collapse, and folding is strongly dependent on fg for a fixed fs- Both TF{fo) 
and the /Q-dependent collapse time increase as /q increases. The value of Tp(fQ) 
can be nearly an order of magnitude greater than the value at /q. 

The interpretation of the force-quench folding trajectories is found by examining 
(Fig. 8b) the nature of the initial structural ensemble (|4lj|87 , 60 ) . The initial struc- 
tural ensemble for the bulk measurement is thermally denatured ensemble (TDE) 
while the initial structural ensemble under high tension is the fully stretched (FDE, 
force denatured ensemble). Upon force quench a given molecule goes from a small 
entropy state (FDE) to an ensemble with increased entropy to the low entropy 
folded state (NBA) (Fig. 8b). Therefore, it is not unusual that the folding kinetics 
upon force quench is vastly different from the bulk measurements. 

The folding rate upon force-quench is slow relative to bulk measurements. A com- 
prehensive theory of the generic features of x(t) relaxation and sequence-specific 
effects for folding upon force quench showed that refolding pathways and /q- 
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dependent folding times are determined by an interplay of tf(Jq) and the time 
scale, tq, in which fs — > /q quench is achieved (pOb. If tq is small then the 
molecule is trapped in force-induced metastable intermediates (FIMIs) that are 
separated from the NBA by a free energy barrier. The formation of FIMIs is 
generic to the force-quench refolding dynamics of any biopolymer. Interestingly, 
the formation of DNA toroid under tension revealed using optical tweezers experi- 
ments is extremely slow (~ 1 hour at Jq f=a 1 pN). 

6.7 Force Correlation Spectroscopy (FCS) 

The relevant structures that guide folding from stretched state may be inferred 
using Force Correlation Spectroscopy (FCS) ffi\. In such experiments the dura- 
tion At in which Jq is held constant (to initiate folding) is varied (Fig. 9a). If 



At/rF (/q) 3> 1 then it corresponds to the situation probed in (37) whereas in the 
opposite limit folding is disrupted. Thus, by cycling between fs and /q, and vary- 
ing the time in /q , the nature of the collapsed conformations can be unambiguously 
discerned. The theoretical suggestion was implemented in a remarkable experiment 



by Fernandez and coworkers using poly-Ub (45). By varying At from about 0.5 s 
to 15 s, they found that the increase in the extension upon /q — > fs jump could 
be described using a sum of two exponential functions (Fig. 9b). The rate of the 
fast phase, which amounts to disruption of collapsed structures, is 40 times greater 
than in the slow phase that corresponds to unfolding of the native structure. The 
ensemble of mechanically weak structures that form on a ms time scale corresponds 
to the theoretically predicted MECS. The experiments also verified that MECS 
are separated from NBA by free energy barriers. The single molecule force clamp 
experiments have unambiguously showed that folding occurs by a three stage mul- 
tipathway approach to the NBA. Such experiments are difficult to perform by 
triggering folding by dilution of denaturants because Rq of the DSE is not signif- 
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icantly larger than the native state. Consequently, the formation of MECS is far 
too rapid to be detected. The use of / increases these times, making the detection 
of MECS easier. 

7 CONCLUSIONS 

The statistical mechanical perspective and advances in experimental techniques 
have revolutionized our view of how simple single domains proteins fold. What a 
short while ago seemed to be mere concepts are starting to be realized experimen- 
tally thanks to the ability to interrogate the folding routes one protein molecule at 
a time. In particular, the use of force literally allows us to place a single protein 
at any point in the multidimensional free energy surface and watch it fold. Using 
advances in theory, and simulations it appears that we have entered an era in which 
detailed comparisons between predictions and experiments can be made. Compu- 
tational methods have even been able to predict the conformations explored by 
interacting proteins with the recent story of the Rop dimer being a good example 



(44). The promise that all atom simulations can be used to fold at least small pro- 
teins, provided the force-fields are reliable, will lead to an unprecedented movie of 
the folding process that will also include the role water plays in guiding the protein 
to the NBA. 

Are the successes touted here and elsewhere cause for celebration or should it be 
deemed "irrational exuberance"? It depends on what is meant by success. There 
is no doubt that an edifice has been built to rationalize and in some instances even 
predict the outcomes of experiments on how small (less than about 100 residue) 
proteins fold. However, from the perspective of an expansive view of the protein 
folding problem, advertised in the Introduction, much remains to be done. We are 
far from being able to predict the sequence of events that drive the unfolded proteins 
to the NBA without knowing the structure of the folded state. From this view point 
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both structure prediction and folding kinetics are linked. Regardless of the level of 
optimism (or pessimism) it is clear that the broad framework that has emerged by 
intensely studying the protein folding problem will prove useful as we start tackling 
more complex problems of cellular functions that involve communication between 
a number of biomolecules. An example where this approach is already evident 
is in the development of the Iterative Annealing Mechanism for describing of the 
function of the GroEL machine, which combines concepts from protein folding and 
allosteric transitions that drive GroEL through a complex set of conformational 



changes during a reaction cycle (132). Surely, the impact of the concepts developed 



to understand protein folding will continue to grow in virtually all areas of biology. 

8 SUMMARY POINTS 

1. Several properties of proteins ranging from their size and folding cooperativity 
depend in an universal manner on the number (N) of amino acid residues. The 
precise dependence on these properties as N changes can be predicted accurately 
using polymer physics concepts. 

2. Examination of the folding landscapes leads to a number of scenarios for self- 
assembly. Folding of proteins with simple architecture can be described using the 
nucleation-collapse mechanism with multiple folding nuclei, while those with com- 
plex folds reach the Native Basin of Attraction NBA by the Kinetic Partitioning 
Mechanism. 

3. The time scales for reaching the NBA, which occurs in three stages depending 
on the protein fold, can be estimated in terms of N . The predictions are well 
supported by experiments. 

4. The Molecular Transfer Model, which combines simulations and the classical trans- 
fer model, accurately predicts denaturant-dependent quantities measured in en- 
semble and single molecule FRET experiments. In the process it is shown that the 
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melting temperatures are residue-dependent, which accords well with a number of 
experiments. 
5. The heterogeneity in the unfolding pathways, predicted theoretically, is revealed 
in experiments that use mechanical force to trigger folding and unfolding. Studies 
on GFP show the need to combine simulations and AFM experiments to map the 
folding routes. Novel force protocol, proposed using theory, reveals the presence 
of Minimum Energy Compact Structures predicted using simulations. 
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Figure Legends 

Figure 1: (a) Dependence of Rq on N. Data are taken from (80) and the line 



is the fit to the Flory theory, (b) Rq versus N (29). 

Figure 2: (a)Temperature (in Centigrade) dependence of /nba, and its deriva- 



tive \ S3 ^ A \. (b) Plot of log(AT/l» versus log AT. The linear fit (solid line) to 
the experimental data for 32 proteins shows ^— ~ N~ x with A = 1.08 ± 0.04 



with a correlation coefficient 0.95 (88). (c) Plot of logil c versus log N. The solid 
line is a fit to the data with £ = 2.17 ±0.09 (correlation coefficient is 0.95). Inset 
shows denaturation data. 

Figure 3: Schematic of the rugged folding landscape of proteins with energetic 
and topological frustration. A fraction <£ of unfolded molecules follow the fast 
track (white) to the NBA while the remaining fraction (1— $) of slow trajectories 
(green) are trapped in one of the CBAs. (b) Summary of the mechanisms by 
which proteins reach their native state. The upper path is for fast track molecules. 
3> ~ 1 implies the folding landscape is funnel-like. The lower routes are for slow 
folding trajectories (green in (a)). The number of conformations explored in the 
three stages as a function of N are given below, with numerical estimates for 
A" = 27. The last line gives the time scale for the three processes for A" = 100 
using the estimates described in the text, (c) Multiple folding nuclei model for 



folding of a lattice model with side chains with N = 15 (77). The probability of 
forming the native contacts (20 in the native state shown in black) in the TSE 
is given in purple. The average structure in the three major clusters in the TSE 
are shown. There is a non-native contact in the most probable cluster (shown 
in the middle). The native state is on the right, (d) Dependence of the folding 



times versus vJV for 69 residues (adapted from (98)). Red line is a linear fit 
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(correlation coefficient is 0.74) and the blue circles are data. 

Figure 4: (a) Diagram for MTM theory (Top). Ei{A) {Ei{B)) is the energy of 

the i th microstate in condition A (B), while Z(A) and Z(B) are the corresponding 

partition functions, (b) Linear correlation between calculated (using the TM) 

and measured m- values for proteins in urea Q. (c) Predictions using MTM 

versus experiments (symbols). Protein L is in blue and red is for CspTm. (d) 

Comparison of the predicted FRET efficiencies versus experiments for protein L. 

MTM results for (E) of the native state (red line), DSE (blue line), and average 

(black line) are shown. Experimental values for the (E) for the DSEs are in blue 



solid squares (119) and open squares (93). 

Figure 5: (a) Distribution of R ee /(R ee ) for the DSE ensemble at 5, 7, and 9M 
GdmCl concentrations. Line is the universal curve for polymers in good solvent, 
(b) Predicted values of the average Rq (open circle) and R ee (x) as a function of 
urea for protein L. The broken lines show the corresponding values for the DSE 
as a function [C]. (c) Histogram of T m ^ values for 158 protons for BBL obtained 



using NMR (taken from (112)). (d) Predicted T m ^ values using MTM for protein 
L. (e) Histogram of C m ^ values for protein L. (f) Mean field interaction energy 
for three proteins versus [C] ( |143 ). 

Figure 6: Top left shows a schematic of the LOT set up used to generate FEC 
and x(t) for RNase-H. The curves below are unfolding FECs. The refolding FEC 
on the right shows the U —> I transition. The right figure shows the proposed 
folding landscape for the transition from U to N through /. The folding trajectory 
is superimposed on top of the folding landscape. Figure adapted from ( [l8| ). 

Figure 7: Folding landscape for GFP obtained using SOP simulations and 
AFM experiments. Top and bottom left show the folded structure and the con- 
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nectvity of secondary structural elements. The right shows the bifurcation in the 

pathways from the NBA to the stretched state. 

Figure 8: (a) Force quench refolding trajectory of poly-Ub generated by AFM 



(from (37)). The black curve shows contraction in x(t) after fully stretching poly- 
Ub. (b) Sketch of the folding mechanism of a polypeptide chain upon fg — } /q 
quench. Rapid quench generates a plateau in x(t) (FIMI) followed by exploration 
of MECS prior to reaching the NBA. Chain entropy goes from a small value 
(stretched state) to large value (compact conformations) to a low value (NBA). 
Figure 9: (a) Sketch of force pulse used in Force Correlation Spectroscopy 
(FCS). Polypeptide chain is maintained at fn for arbitrary times before stretch- 
ing, (b) Increase in extension of poly-Ub upon application of stretching force for 



various At values ( 45 ) . 
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